Error-Driven Boolean-Logic-Rule-Based Learning for Mining Chat-room Conversations
نویسندگان
چکیده
The ephemeral nature of human communication via networks today poses interesting and challenging problems for information technologists. The sheer volume of communication in venues such as email, newsgroups, and chat precludes manual techniques of information management. Currently, no systematic mechanisms exist for accumulating these artifacts of communication in a form that lends itself to the construction of models of semantics [5]. In essence, dynamic techniques of analysis are needed if textual data of this nature is to be effectively mined. At Lehigh University we are developing a text mining tool for analysis of chat-room conversations. Project goals concentrate on the development of functionality to answer questions such as “What topics are being discussed in a chat-room?”, “Who is discussing which topics?” and “Who is interacting with whom?” The objective is to develop technology that can automatically identify such patterns of interaction in both social and semantic terms. In this article we present our preliminary findings for a novel technique developed to identify threads of conversation in multitopic, multi-person chat-rooms. This is the first step towards building models of social and semantic interaction. We term our technique Error-Driven Boolean-LogicRule-Based Learning (BLogRBL), a variation on Brill’s Transformation Based Learning [11] [12] [13]. Similar to Brill’s method, rules are automatically derived from templates during learning. It differs from Brill’s technique in that rules take the form of complex expressions of combinational logic. We report on the scope and design of our technique, as well as discussing preliminary results. 1.0 Background And Motivation The goal of the project is to develop a computational approach towards understanding social and semantic interactions in textual mediums. Chat-room conversation has been identified as the communication medium of interest due to its increasing popularity and the need for research in the area. The 430 million daily instant messages on AOL’s network alone provide a treasure trove of knowledge [3]. Chat conversation is radically different from various other mediums due to its often informal nature. Existing text mining techniques rely on more structured, formal corpuses containing research papers, abstracts, technical reports, etc. Approaches toward understanding the dynamics of chat conversation are limited, and as usage grows the need for automated analysis increases. Due to the dynamic nature of chat conversations, dynamic modeling of social interactions and their contextual topics is a genuine research challenge. This research is being conducted at the behest of the Intelink intelligence network. Intelink is a secure military communications channel used for critical exchanges of information. Intelink’s goal is to monitor chat conversation over the network and map relationships between users and their topics of conversation to determine the appropriateness of usage and the effectiveness of the communication network. They are interested in information such as the frequency of employee communication, the topics discussed, conversational participants and the focus of the conversations. This research has, however, applications beyond the scope of Intelink’s needs. In fact, the techniques under development apply to any organization with an internal communications network, and perhaps to Internet users in general – patrons of chat services such as AOL Instant Messenger (AIM) and IRC (Internet Relay Chat) could benefit from utilizing such a tool. 1.1 Modeling Social & Semantic Relations The application under development at Lehigh University to model social and semantic interactions is the Social Semantic Builder (SSB). The SSB is a relational modeling tool that utilizes the HDDI [6][2][1] text mining infrastructure, and models relationships between distinct conceptual and/or behavioral abstractions. The purpose of the SSB is to determine, analyze, and model the relationships and interactions between these abstract relational entities. As an example, consider the domain of research papers. Abstractions within this application field include authors of the papers and the concepts they explore. Corpuses are constructed and used to cluster instances of the abstract entities according to their co-relational properties. In this example, two separate models would be created, one of authors who write together and another one of concepts that are similar within the document space. Continuing the example, the SSB would combine these two models to create a meta level model between research paper authors and their conceptual content. The metamodel could then be used to analyze and discover previously unknown relationships between authors and their content. This would allow questions such as “What topics are being discussed?”, “Who is discussing which topics?” and “Who is authoring with whom?” to be answered. Although we have presented the SSB in the context of research article authors and content, social and semantic modeling can take place in any domain involving multiple authors and content. 2.0 Application Domain: Chat The Social Semantic Builder is a utility with a variety of applications. We have discussed authors and research papers, but there are also students and courses, journalists and newspaper articles, etc. The SSB is designed in such a fashion that its general relational structure could be deployed for mapping between various types of entities. Each application has its own particular issues that need to be addressed, and in this article we address those issues relevant to the analysis of chat conversations. Some of the questions particular to chat are “Who are the participants in a particular conversation?” “What are they talking about?” “How focused is their conversation?” “How are the participants socially interacting?” “What forms of language do they use to express themselves?”. A user (such as Intelink) interested in such chat relational models would be able to use the SSB for extracting such information. Chat conversational documents would be input into the SSB, and it would create models of chat participants and conversational topics. A user then could use the models to associate topics with participants, observing which participants discussed a particular topic. Information such as which participants were involved in discussions together, and the topic of those discussions would be readily available. The basic questions of “What topics are being discussed in a chat-room?", "Who is discussing which topics?" and "Who is interacting with whom?" can thus be answered. 2.1 Chat Input Issues The SSB accepts input in XML form, utilizing the HDDI infrastructure for processing the documents. At present, the HDDI System processes the input and then constructs a collection of statistics for analysis. The models for the social and semantic domains are then created and linked together by the Social Semantic Model Builder. Some application domains easily map to this input format such as research papers and newsgroup postings . The authors are identified at the beginning, and the body of the document or posting can be tagged as content. Chat conversation is not so structured. Chat is often a continuous medium with users entering and leaving a given chat room. Furthermore, even though a chat room may have many users logged in, not all of them may be participating. Of those users who are involved, they do not all participate in the same discussion with all the other users. Often there are several conversations simultaneously taking place between users – a single participant may also be involved in multiple conversations at once. It is an extremely chaotic environment and at first glance seems to lack consistent structure. In this situation, various chat conversations are interlaced throughout multiple postings and extracting “authors” and their content into single cohesive units for input to the SSB is a daunting task. Furthermore, in our research we have observed that there are numerous categories of chat. Factors such as the number of participants, the topic(s) of chat, the familiarity of users with each other, etc. lead to radically different conversation styles. If a conversation is between acquaintances discussing a common topic, for example, the conversational flow tends to be informal with little attention paid to grammar. If the session is, however, a help session or a discussion medium for a focused topic in which the users don’t know one another, the conversation is typically focused and formal and a broader usage of vocabulary is observed. Our objective is to partition chat data into collections of postings composed of two or more authors discussing a single topic, 1 Margaret A. Root defines postings as single messages entered into a network communication system (e.g., chat room or Usenet Newsgroup) [18]. creating input (that we refer to as items) for analysis by the SSB. Thus, each item consists of postings relevant to a single topic, and the users who participated in that topic. Within this framework, the names/screen names of the posters identify authors and the postings identify content. A co-authorship relationship is defined between users based on the content of their postings, and separate content and author clusters are created by the SSB. Each SSB input item is thus a thread of conversation or discussion revolving around a single, or group of very similar topics. 3.0 Conversational Flow As Threads As noted, in online chat environments, there are often multiple discussions taking place and within a particular room or channel, authors will participate in multiple discussions, or items [9]. As a result, in the log of a chat session various items overlap as content from multiple discussions is interlaced. We thus define a thread or item as a collection of multiple authors’ postings grouped together by semantic similarity. Intuitively, a thread should consist of postings that share a similar meaning that is dissimilar from other postings. Clearly, a thread should be composed of the postings of at least two authors. Figure 1: A Sample Chat Conversation In order to be discernable, a thread must contain sufficient semantics to discriminate it from other threads. The challenge lies in 1. Mike: What kind of food do you like to eat? 2. Anne: I like ice-cream 3. Tom: I'm a vegetarian 4. Jane: Fruit is my favorite 5. Anne: Fruit’s cool. 6. Anne: I had a banana for breakfast 7. Joe: I like ice cream too. 8. Mike: My favorite ice-cream flavor
منابع مشابه
Text Categorization Approach For Chat Room Monitoring
The Internet has been utilized in several real life aspects such as online searching, and chatting. On the other hand, the Internet has been misused in communication of crime related matters. Monitoring of such communication would aid in crime detection or even crime prevention. This paper presents a text categorization approach for automatic monitoring of chat conversations since the current m...
متن کاملChat mining: Automatically determination of chat conversations' topic in Turkish text based chat mediums
Mostly, the conversations taking place in chat mediums bear important information concerning the speakers. This information can vary in many fields such as tendencies, habits, attitudes, guilt situations, and intentions of the speakers. Therefore, analysis and processing of these conversations are of much importance. Many social and semantic inferences can be made from these conversations. In d...
متن کاملMining Chat-room Conversations for Social and Semantic Interactions
The ephemeral nature of human communication via networks today poses interesting and challenging problems for information technologists. The Intelink intelligence network, for example, has a need to monitor chat-room conversations to ensure the integrity of sensitive data being transmitted via the network. However, the sheer volume of communication in venues such as email, newsgroups, and chat ...
متن کاملA Learning-Based Approach for the Identification of Sexual Predators in Chat Logs
The existence of sexual predators that enter into chat rooms or forums and try to convince children to provide some sexual favour is a socially worrying issue. Manually monitoring these interactions is a way to attack this problem. However, this manual approach simply cannot keep pace because of the high number of conversations and the huge number of chatrooms or forums where these conversation...
متن کاملTheory and Algorithms for Information Extraction and Classification in Textual Data Mining
Regular expressions can be used as patterns to extract features from semi-structured and narrative text [8]. For example, in police reports a suspect’s height might be recorded as “{CD} feet {CD} inches tall”, where {CD} is the part of speech tag for a numeric value. The result in [1] shows us that regular expressions could have higher performance than explicit expressions in some applications ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002